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Abstract 


We develop a worst-case analysis of aggregation of classifier ensembles for binary classifica¬ 
tion. The task of predicting to minimize error is formulated as a game played over a given set of 
unlabeled data (a transductive setting), where prior label information is encoded as constraints on 
the game. The minimax solution of this game identifies cases where a weighted combination of the 
classifiers can perform significantly better than any single classifier. 

Keywords: Ensemble aggregation, transductive, minimax 

1. Introduction 

Suppose that we have a finite set, or ensemble, of binary classifiers H = {/ii, /i 2 , • • •, hp}, with 
each hi mapping data in some space Af to a binary prediction { — 1, +1}. Examples (x^y) C X x 
{—1,+1} are generated i.i.d. according to some fixed but unknown distribution V, where y G 
{—1, +1} is the class label of the example. We write the expectation with respect to V or one of its 
marginal distributions as Ep [•]. 

Consider a statistical learning setting, in which we assume access to two types of i.i.d. data: 
a small set of labeled training examples S = . (x^, y'm)} drawn from V and a much 

larger set of unlabeled test examples U = {xi,... ,x^} drawn i.i.d. according to the marginal 
distribution over X induced by V. A typical use of the labeled set is to find an upper bound on the 
expected error rate of each of the classifiers in the ensemble. Specifically, we assume a set of lower 
bounds > 0 such that the correlation corr(/i^) := Ep [yhi{x)] satisfies corr(/i^) > bi. 

If we ignore the test set, then the best we can do, in the worst case, is to use the classifier with 
the largest correlation (smallest error). This corresponds to the common practice of empirical risk 
minimization (ERM). However, in many cases we can glean useful information from the distribution 
of the test set that will allow us to greatly improve over ERM. 

We motivate this statement by contrasting two simple prediction scenarios, A and B. In both 
cases there are = 3 classifiers and n = 3 unlabeled test examples. The correlation vector is 
b = (1/3,1/3,1/3); equivalently, the classifier error rates are 33%. Based on that information, the 
predictor knows that each classifier makes two correct predictions and one incorrect prediction. 

So far, both cases are the same. The difference is in the relations between different predictions 
on the same example. In case A, each example has two predictions that are the same, and a third 
that is different. In this case it is apparent that the majority vote over the three classifiers has to be 
correct on all 3 examples, i.e. we can reduce the error from ^ to 0. In case B, all three predictions 
are equal for all examples. In other words, the three classification rules are exactly the same on the 
three examples, so there is no way to improve over any single rule. 
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These cases show that there is information in the unlabeled test examples that can be used to 
reduce error - indeed, cases A and B can be distinguished without using any labeled examples. 
In this paper, we give a complete characterization of the optimal worst-case (minimax) predictions 
given the correlation vector b and the unlabeled test examples. 

Our development does not consider the feature space A' directly, but instead models the knowl¬ 
edge of the p ensemble predictions on U with a p x n matrix that we denote by F. Our focus is 
on how to use the matrix F in conjunction with the correlation vector b, to make minimax optimal 
predictions on the test examples. 

The rest of the paper is organized as follows. In Section 2 we introduce some additional notation. 
In Section 3 we formalize the above intuition as a zero-sum game between the predictor and an 
adversary, and solve it, characterizing the minimax strategies for both sides by minimizing a convex 
objective we call the slack function. This solution is then linked to a statistical learning algorithm 
in Section 4. 

In Section 5, we interpret the slack function and the minimax strategies, providing more toy 
examples following the one given above to build intuition. In Section 6 we focus on computational 
issues in running the statistical learning algorithm. After discussing relations to other work in 
Section 7, we conclude in Section 8. 


2. Preliminaries 


The main tools we use in this paper are linear programming and uniform convergence. We therefore 
use a combination of matrix notation and the probabilistic notation given in the introduction. The 
algorithm is first described in a deterministic context where some inequalities are assumed to hold; 
probabilistic arguments are used to show that these assumptions are correct with high probability. 
The ensemble’s predictions on the unlabeled data are denoted by F: 


/hi{xi) hi{x 2 ) 


\hp{xi) hp{x 2 ) 


hiixn)\ 

hp{Xn)J 




( 1 ) 


The true labels on the test data U are represented by z = (zi;...; Zn) ^ [—1, l]’^- 

Note that we allow F and z to take any value in the range [—1,1] rather than just the two 
endpoints. This relaxation does not change the analysis, because intermediate values can be in¬ 
terpreted as the expected value of randomized predictions. For example, a value of 1 indicates 
{+1 w.p. |, —1 w.p. |}. This interpretation extends to our definition of the correlation on the test 
set, coriuihi) = ^ Yl]=i hi{xj)zj. * 

The labels z are hidden from the predictor, but we assume the predictor has knowledge of a 
correlation vector b > 0^ such that coriuihi) > bi for all i G [p], i.e. ^Fz > b. From our 
development so far, the correlation vector’s components bi each correspond to a constraint on the 
corresponding classifier’s test error ^{1 — bi). 

The following notation is used throughout the paper: [a]+ = max(0,a) and [a]_ = [—o]+, 
[n] = {1,2,..., n}, = (1; 1;... ; 1) G M^, and 0^ similarly. Also, write In as the n x n identity 

matrix. All vector inequalities are componentwise. The probability simplex in d dimensions is 

1. We are slightly abusing the term “correlation” here. Strictly speaking this is just the expected value of the product, 
without standardizing by mean-centering and rescaling for unit variance. We prefer this to inventing a new term. 
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denoted by = {a > 0^ : Yli=i ^2 = 1}- Finally, we use vector notation for the rows and 
columns of F: = {hi{xi),hi{x2)r " andxj = {hi{xj),h 2 {xj),- - ,hp{xj)y. 


3. The Transductive Binary Classification Game 

We now describe our prediction problem, and formulate it as a zero-sum game between two players: 
a predictor and an adversary. 

In this game, the predictor is the first player, who plays g = {gi] g 2 ;g^), a randomized label 
gi G [—1,1] for each example The adversary then plays, setting the labels z G [—1,1]’^ 

under ensemble test error constraints defined by b. The predictor’s goal is to minimize (and the 
adversary’s to maximize) the worst-case expected classification error on the test data (w.r.t. the 
randomized labelings z and g): ^ (l — ^z^g). This is equivalently viewed as maximizing worst- 
case correlation -z^g. 

To summarize concretely, we study the following game: 

V max min —z^g (2) 

gG[-l,l]'» ZG[-1,1]”, n 

^Fz>b 

n — 

It is important to note that we are only modeling “test-time” prediction, and represent the infor¬ 
mation gleaned from the labeled data by the parameter b. Inferring the vector b from training data 
is a standard application of Occam’s Razor Blumer et al. (1987), which we provide in Section 4. 

The minimax theorem (e.g. Cesa-Bianchi and Lugosi (2006), Theorem 7.1) applies to the game 
(2), since the constraint sets are convex and compact and the payoff linear. Therefore, it has a 
minimax equilibrium and associated optimal strategies g*,z* for the two sides of the game, i.e. 
rniuz ^z’^g* = y = maxg ^z* g . 

As we will show, both optimal strategies are simple functions of a particular weighting over the 
p hypotheses - a nonnegative p-vector. Define this weighting as follows. 

Definition 1 (Slack Function and Optimal Weighting) Let a > 0 ^ be a weight vector over % 
(not necessarily a distribution). The vector of ensemble predictions is F^cr = ... ,x^(j), 

whose elements' magnitudes are the margins. The prediction slack function is 


7(a,b) = 7(a) [|xj 


T, 


a 


- 1 


-b^fT 


(3) 


An optimal weight vector a* is any minimizer of the slack function: a* G arg min [y{cr)]. 

a>0P 

Our main result uses these to describe the solution of the game (2). 


Theorem 2 (Minimax Equilibrium of the Game) The minimax value of the game (2) is V = 

— 7 ((j*). The minimax optimal strategies are defined as follows: for all i G [n], 


9i = = 



IxTo-*! < 1 

and 


1 T * 1 

X-' (J \ 

|sgn(xTa*) 

1 ^ 1 

Otherwise 

z* 

|sgn(xTa*) 

1 T * 1 

Fi ^ 1 


< 1 
> 1 


(4) 
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Ensemble average prediction x 


Figure 


The optimal strategies and slack function as a function of the ensemble prediction x^cr*. 


The proof of this theorem is a standard application of Lagrange duality and the minimax theo¬ 
rem. The minimax value of the game and the optimal strategy for the predictor g* (Lemma 8 ) are 
our main objects of study and are completely characterized, and the theorem’s partial description of 
z* (proved in Lemma 9) will suffice for our purposes. ^ 

Theorem 2 illuminates the importance of the optimal weighting cr* over hypotheses. This 
weighting a* G argmin^>Qp 7 ( 0 -) is the solution to a convex optimization problem (Lemma 11), 
and therefore we can efficiently compute it and g* to any desired accuracy. The ensemble prediction 
(w.r.t. this weighting) on the test set is F^cr*, which is the only dependence of the solution on F. 

More specifically, the minimax optimal prediction and label (4) on any test set example Xj 
can be expressed as functions of the ensemble prediction xja* on that test point alone, without 
considering the others. The F-dependent part of the slack function also depends separately on each 
test point’s ensemble prediction. Figure 1 depicts these three functions. 

4. Bounding the Correlation Vector 

In the analysis presented above we assumed that a correlation vector b is given, and that each 
component is guaranteed to be a lower bound on the test correlation of the corresponding hypothesis. 
In this section, we show how b can be calculated from a labeled training set. 

The algorithm that we use is a natural one using uniform convergence: we compute the empirical 
correlations for each of the p classifiers, and add a uniform penalty term to guarantee that the b is a 
lower bound on the correlation of the test data. For each classifier, we consider three quantities: 

• The true correlation: corr(/i) = Ep [yh{x)] 

• The correlation on the training set of labeled data: cons{h) = ^ YllLi h{x[)y[ 

• The correlation on the test set of unlabeled data: c5nij{h) — ^ 

2. For completeness, Corollary 10 in the appendices specifies zl when jx^cr* I = 1. 


4 











Minimax Classifier Aggregation 


Using Chernoff bounds, we can show that the training and test correlations are concentrated near 
the true correlation. Specifically, for each individual classifier h we have the two inequalities 

Pr > corr(/i) + es) < 

Pr {coriu{h) < corr(/i) — eu) < 

Let 6 denote the probability we allow for failure. If we set es = ejj = 

we are guaranteed that all the 2p inequalities hold concurrently with probability at least 1 — 6. 

We thus set the correlation bound to: 

bi := corisihi) - es - eu 

and have with probability >1 — 5 that b is a good correlation vector, i.e. corYu{hi) >bi Vi G [p\. 


5. Interpretation and Discussion 


Given a, we partition the examples x into three subsets, depending on the value of the ensemble 
prediction: the hedged set H{a) := {x : |x^cr| < l}, the clipped set C{a) := {x : |x^(j| > l}, 
and the borderline set B{a) := {x : |x^cr| = l}. Using these sets, we now give some intuition 
regarding the optimal choice of g and z given in (4), for some fixed a. 

Consider first examples x^ in H(a). Here the optimal gi is to predict with the ensemble predic¬ 
tion xja, a number in (—1,1). Making such an intermediate prediction might seem to be a type of 
calibration, but this view is misleading. The optimal strategy for the adversary in this case is to set 
Zi = 0, equivalent to predicting ±1 with probability 1/2 each. The reason that the learner hedges 
is because if gi < xja, the adversary would respond with Zi = 1 and with = — 1 if > x^ a. 
In either case, the loss of the predictor would increase. In other words, our ultimate rationale for 
hedging is not calibration, but rather “defensive forecasting” in the spirit of Vovk et al. (2005). 

Next we consider the clipped set Xj G C{a). In this case, the adversary’s optimal strategy is 
to predict deterministically, and so the learner matches the adversary here. It is interesting to note 
that with all else held equal, increasing the margin xJ a beyond 1 is suboptimal for the learner. 

Qualitatively, the reason is that while xJ a continues to increase, the prediction for the learner is 
clipped, and so the value for the learner does not increase with the ensemble prediction. 


5.1. Subgradient Conditions 

For another perspective on the result of Theorem 2, consider the subdifferential set of the slack 
function 7 at an arbitrary weighting a: 


dl{(y) = U I X] XjSgn(xJa) + CjXj sgn(xjc7) j - b, Vc^ G [0,1] I 

U \^i6C(cr) / ) 


(5) 


Note that the hedged set plays no role in d^{a). Since the slack function 7 (-) is convex (Lemma 
1 1), the sub-differential set (5) at any optimal weighting a* contains 0, i.e., 


3cj€[0,1] s.t. nb- Xj+ = X CjXjSgn{xJa*^ 


( 6 ) 


j:xj(7*>l j:xj(7*<-l j:|xj(7*1=1 
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Figure 2: An illustration of the optimal a* > 0^. The vector nb is the difference between the sums of 
two categories of clipped examples: those with high ensemble prediction (x^cr* >1) and low 
prediction (< —1). The effect of B{a*) is neglected for simplicity. 


The geometric interpretation of this equation is given in Figure 5.1. The optimal weighting cr* 
partitions the examples into five sets: hedged, positive borderline and positive clipped, and negative 
borderline and negative clipped. Taking the difference between the sum of the positive clipped 
and the sum of the negative clipped examples gives a vector that is approximately b. By adding a 
weighted sum of the borderline examples, b can be obtained exactly. 

5.2. Beating ERM Without Clipping 

We now make some brief observations about the minimax solution. 

First, note that no a such that \\cr\\-^ < 1 can be optimal, because in such a case > 

—7 therefore, ||cr*||^ > 1 . 

Next, suppose we do not know the matrix F. Then ||cr*||i = 1. This can be shown by proving 
the contrapositive. Assuming the negation 1 < ||cr*||i := a, there exists a vector x E [—1,1]^ such 
that x^cr* = ||cr* 111 > 1. If each of the columns of F is equal to x, then by definition of the slack 
function, —7 (^) > — 7 ((j*), so cr* cannot be optimal. 

In other words, if we want to protect ourselves against the worst case F, then we have to set 
||(j||i = 1 so as to ensure that C(a) is empty. In this case, the slack function simplifies to = 

—b^cr, over the probability simplex. Minimizing this is achieved by setting to be 1 at arg max bi 

ie[p\ 

and zero elsewhere. So as might be expected, in the case that F is unknown, the optimal strategy is 
simply to use the classifier with the best error guarantee. 

This is true because C(cr*) is empty, and the set of all a such that C(cr*) is empty is of wider 
interest. We dub it the Zero Box Region: ZBR = {a : C{(j) — 0}. Another clean characterization 
of the ZBR can be made by using a duality argument similar to that used to prove Theorem 2. 
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Theorem 3 The best weighting in ZBR satisfies max a = max 

a>0P 

particular, the optimal cr* G ZBR if and only if the hypercube constraint z G [— 


i.e. when V — min max — z 

^Fz>b gG[-l,ll’^ ^ 




1 T 

mm —z g . In 
^Fz>b n 

n — 

1, is superfluous, 


The ZBR is where the optimal strategy is always to hedge and never to incorporate any clipping. 
Consider a situation in which the solution is in ZBR, cr* = 1^, and all of the predictions are binary: 
F G {—This is an ideal case for our method; instead of the baseline value max^ bi 
obtained when F is unknown, we get a superior value of 

In fact, we referred to such a case in the introduction, and we present a formal version here. 
Take p to be odd and suppose that n — p. Then set F to be a matrix where each row (classifier) and 
each column (example) contains (p + l)/2 entries equal to +1 and (p — l)/2 entries equal to —1. ^ 
Finally choose an arbitrary subset of the columns (to have true label —1), and invert all their entries. 

In this setup, all classifiers (rows) have the same error: b = ^1^. The optimal weight vector 
in this case is cr* = 1^, the solution is in ZBR because |x^cr* | = 1 Vx, and the minimax value is 
y = 1, which corresponds to zero error. Any single rule has an error of ^ so using F with p 
classifiers has led to a p-fold improvement over random guessing! 

Of course, this particular case is extremal in some ways; in order to be in ZBR, there must 
be many cancellations in F^cr*. This echoes the common heuristic belief that, when combining an 
ensemble of classifiers, we want the classifiers to be “diverse” (e.g. Kuncheva and Whitaker (2003)). 
The above example in fact has the maximal average disagreement between pairs of classifiers for a 
fixed p. Similar results hold if F is constructed using independent random draws. 

So our formulation recovers ERM without knowledge of F, and can recover an (unweighted) 
majority vote in cases where this provides dramatic performance improvements. The real algorith¬ 
mic benefit of our unified formulation is in automatically interpolating between these extremes. 

To illustrate, suppose F is given in Table 1, in which there are six classifiers partitioned into two 
blocs, and six equiprobable test examples. Here, it can be seen that the true labeling must be + on 
all examples. 



A classifiers 

B classifiers 


hi 

h2 

h3 

h4^ 

/is 

he 

Xi 

- 

+ 

+ 

+ 

+ 

+ 

X2 

- 

-h 

-h 

+ 

+ 

+ 

Xs 

+ 

- 

-h 

+ 

+ 

+ 

X4 

+ 

- 

-h 

+ 

+ 

+ 

X5 

+ 

-h 

- 

+ 

+ 

+ 

X6 

-H 

+ 

- 

- 

- 

- 

h 

1/3 

1/3 

1/3 

2/3 

2/3 

2/3 


Table 1: Example with two classifier blocs. 
3. For instance, by setting Fij — 1 if {i + j) is even, and —1 otherwise. 
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In this situation, the best single rule errs on xq, as does an (unweighted majority) vote over the 
six classifiers, and even a vote over just the better-performing “B” classifiers. But a vote over the 
“A” rules makes no errors, and our algorithm recovers it with a weighting of cr* = (1; 1; 1; 0; 0; 0). 


5.3. Approximate Learning 

Another consequence of our formulation is that predictions of the form g((j) are closely related to 
dual optima and the slack function. Indeed, by definition of g(cr), the slack function value —j(cr) = 
b^cr — ^ ^ max^/>op [b^ (y' — \ — g((j)||^], which is simply the dual 

problem (Lemma 7) of the worst-case correlation suffered by g(a): 


mm 

^Fz>b 


-Z [g(cr)]. We 


now state this formally. 


Observation 4 For any weight vector a > 0^, the worst-case correlation after playing g(cr) is 
bounded by 

min —z^\g(a)] > —yia) 
zG[-i,i]^, n 


1 

n 


Fz>b 


Observation 4 shows that convergence guarantees for optimizing the slack function directly 
imply error guarantees on predictors of the form g(cr), i.e. prediction rules of the form in Fig. 1. 


5.4. Independent Label Noise 


An interesting variation on the game is to limit the adversary to Zi G [— oli]^ for some ol = 
{ai] ...; an) G [0, l)’^. This corresponds to assuming a level 1 — ai of independent label noise 
on example i: the adversary is not allowed to set the label deterministically, but is forced to flip 
example i’s label independently with probability ^{1 — ai). 

Solving the game in this case gives the result (proof in appendices) that if we know some of the 
ensemble’s errors to be through random noise, then we can find a weight vector a that would give 
us better performance than without such information. 


Proposition 5 (Independent Label Noise) 


1 




max mm —z ' g = max b 

gG[—1,1]’^ —cx<z<cx, n a>0P 

^Fz>b 


T 


n ^ 






- 1 


> max [— 7 (cr)l = V 

a>0P 


Our prediction tends to clip - predict with the majority vote - more on examples with more known 
random noise, because it gains in minimax correlation by doing so. This mimics the Bayes-optimal 
classifier, which is always a majority vote. 

Indeed, this statement’s generalization to the asymmetric-noise case can be understood with 
precisely the same intuition. The sign of the majority vote affects the clipping penalty in the same 
way: 


Proposition 6 (Asymmetric Label Noise) For some 1, u > 


max 


mm 

— 1<Z<U, 
^Fz>b 


1 T 1 T 1 

— z g = max b cr-> 

n a>0P n 


Ui 


xj a — 1 


+ Ij 


—xJ a — 1 


> V 
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6. Computational Issues 


The learning algorithm we presented has two steps. The first is efficient and straightforward: b is 
calculated by simply averaging over training examples to produce corrs{hi). 

So our ability to produce g* is dependent on our ability to find the optimal weighting a* by 
minimizing the slack function 7 (a) over a > 0^. Note that typically p n, and so it is a great 
computational benefit in this case that the optimization is in the dual. 

We discuss two approaches to minimizing the slack function. The most straightforward ap¬ 
proach is to treat the problem as a linear programming problem and use an LP solver. The main 
problem with this approach is that it requires storing all of the examples in memory. As unlabeled 
examples are typically much more plentiful than labeled examples, this approach could be practi¬ 
cally infeasible without further modification. 

A different approach that exploits the structure of the equilibrium uses stochastic gradient de¬ 
scent (SGD). The fact that the slack function is convex guarantees that this approach will converge 
to the global minimum. The convergence rate might be suboptimal, particularly near the inter¬ 
sections of hyperplanes in the piecewise-linear slack function surface. But the fact that SGD is a 
constant-memory algorithm is very attractive. 

Indeed, the arsenal of stochastic convex optimization methods comes into play theoretically and 


practically. The slack 




Ix^al 


1]- 


unction is a sum of i.i.d. random variables, and has a natural limiting object 
— b^cr amenable to standard optimization techniques. 


7. Related Work 

Our duality-based formulation would incorporate constraints far beyond the linear ones we have 
imposed so far, since all our results hold essentially without change in a general convex analysis 
context. Possible extensions in this vein include other loss functions as in multiclass and abstaining 
settings, specialist experts, and more discussed in the next section. 

Weighted majority votes are a nontrivial ensemble aggregation method that has received focused 
theoretical attention for classification. Of particular note is the literature on boosting for forming 
ensembles, in which the classic work of Schapire et al. (1998) shows general bounds on the error 
of a weighted majority vote ewMv(^) under any distribution a, based purely on the distribution of 
a version of the margin on labeled data. 

Our worst-case formulation here gives direct bounds on (expected) test error ewMv(^) as well, 
since in our transductive setting, these are equivalent to lower bounds on the slack function value by 
Observation 4. As we have abstracted away the labeled data information into b, our results depend 
only on b and the distribution of margins |x^cr| among the unlabeled data. Interestingly, Amini 
et al. (2009) take a related approach to prove bounds on ewMv(<5’) in a transductive setting, as a 
function of the average ensemble error a and the test data margin distribution; but their budgeting 
is looser and purely deals with majority votes, in contrast to our g in a hypercube. The transductive 
setting has general benefits for averaging-based bounds also (Blum and Langford (2003)). 

One class of philosophically related methods to ours uses moments of labeled data in the statisti¬ 
cal learning setting to find a minimax optimal classifier; notably among linear separators (Lanckriet 
et al. (2001)) and conditional label distributions under log loss (Liu and Ziebart (2014)). Our for¬ 
mulation instead uses only one such moment and focuses on unlabeled data, and is thereby more 
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efficiently able to handle a rich class of dependence structure among classifier predictions, not just 
low-order moments. 

There is also a long tradition of analyzing worst-case binary prediction of online sequences, 
from which we highlight Feder et al. (1992), which shows universal optimality for bit prediction of 
a piecewise linear function similar to Fig. 1. The work of Cesa-Bianchi et al. (1993) demonstrated 
this to result in optimal prediction error in the experts setting as well, and similar results have been 
shown in related settings (Vovk (1990); Andoni and Panigrahy (2013)). 

Our emphasis on the benefit of considering global effects (our transductive setting) even when 
data are i.i.d. is in the spirit of the idea of shrinkage, well known in statistical literature since at least 
the James-Stein estimator (Efron and Morris (1977)). 

8. Conclusions and Open Problems 

In this paper we have given a new method of utilizing unlabeled examples when combining an 
ensemble of classifiers. We showed that in some cases, the performance of the combined classifiers 
is guaranteed to be much better than that of any of the individual rules. 

We have also shown that the optimal solution is characterized by a convex function we call the 
slack function. Minimizing this slack function is computationally tractable, and can potentially be 
solved in a streaming model using stochastic gradient descent. The analysis introduced an ensemble 
prediction x^cr similar to the margin used in support vector machines. Curiously, the goal of the 
optimization problem is to minimize, rather than maximize, the number of examples with large 
margin. 

Directions we are considering for future research include: 

• Is there an algorithm that combines the convergence rate of the linear programming approach 
with the small memory requirements of SGD? 

• In problems with high Bayes error, what is the best way to leverage the generalized algorithm 
which limits the adversary to a sub-interval of [—1,1]? 

• Can the algorithm and its analysis be extended to infinite concept classes, and under what 
conditions can this be done efficiently? 

• Allowing the classifiers to abstain can greatly increase the representational ability of the com¬ 
bination. Is there a systematic way to build and combine such “specialist” classifiers? 
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Appendix A. Proof of Theorem 2 


The core duality argument in the proofs is encapsulated in an independently useful supporting 
lemma describing the adversary’s response to a given g. 


Lemma 7 For any g E [—1, l]’^, 


Proof We have 
1 




mm —: 
^Fz>b 

n — 


1 




max 

h^a-- 

F^a-g 


cr> 0 P 

n 


1 


mm —z g = — mm max 

zG[-i,il’^, n nzG[-i,il’^ o->op 

Fz>nb 

(«) 1 

= — max mm 
n cr>0P ze[-i,i]^ 

1 

= — max 
n a>0P 

where (a) is by the minimax theorem. 


z^g — cr^(Fz — nh) 
z^(g — F^cr) + nh^ a 


- 

g-F^a 

+ nh^ a 

= max 


F^a-g 


- 


1 J 

a>0P 

n 


1_ 


(7) 

( 8 ) 
(9) 


Now g* and V can be derived. 

Lemma 8 ^cr* is defined as in Theorem 2, then for every i G [n], 

\[F^a*]i\<l 

|sgn([F"''(7*]i) otherwise 

Also, the value of the game (2) is V, as defined in Theorem 2. 


Proof [Proof of Lemma 8 ] From Lemma 7, the primal game (2) is equivalent to 


max 


b^a- 1 

g-F^a 


n 


1 


( 10 ) 


From (10), it is clear that given a setting of a, the g*(cr) that maximizes (10) is also the one that 
minimizes ||g — F^(j||^ under the hypercube constraint: 

^ 1 sgn([F^cr]^), otherwise 


The optimum a here is therefore 


(7* = arg max 

cr> 0 P 




1 

a - 

n 


g*(a)-FTa 


1 


= arg max 

cr> 0 P 





arg min[ 7 ((j)] 

cr>0P 


which finishes the proof. 
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A.l. Derivation of z* 

We can now derive z* from Lagrange complementary slackness conditions. 

Lemma 9 ^cr* is defined as in Theorem 2, then for every i G [n], 

fo, |[FMi|<l 

* \sgn([FTa=^]0, \[F'^a%\>l 

Proof We first rewrite the game slightly to make complementary slackness manipulations more 
transparent. 

Define c = A = [F, -F; € [-1, 1](p+2»^)x 2»^, and B = [nb;e This 

will allow us to reparametrize the problem in terms of C = ([-^i]+,•••, [zn]+, [zi]--, - ■ ■, 

Now apply the minimax theorem (Cesa-Bianchi and Lugosi (2006), Theorem 7.1) to (2) to yield 
the minimax dual game: 


min max —z^g = min — ||z||i ( 11 ) 

lFz>b Fz>nb 

n — 

With the above definitions, (11) becomes 

min —c^C ( 12 ) 

^C>5,C>o n 

This is clearly a linear program (LP); its dual program, equal to it by strong LP duality (since 
we assume a feasible solution exists; Vanderbei (1996)) is 

max —B^X (13) 

x>o,c>A^x n 

Denote the solutions to (12) and (13) as C* and A*. 

By Lemma 8 and the discussion leading up to it, we already know cr*, and therefore only need 
establish the dependence of z* on cr*. Applying LP complementary slackness (Vanderbei (1996), 
Thm. 5.3) to (12) and (13), we get that for all j G [2n], 

[c - >0 ^ Cj = 0 (14) 

and 

A* > 0 ^ [AC* - B]j = 0 (15) 

First we examine (14). The condition [c — A*]j > 0 can be rewritten as 

For any example i < n, if |[F^( 7 *]j| < 1, then [c — > c, — [F^a*; —F^cr*]* > 0 (since 

> 0). Similarly, Cj+n — [F^a*; —F^cr*]j+n > 0. By (14), this means Q = = 0, which 

implies z* = 0 by definition of C*- So we have shown that | [F^u*]* | < 1 z* = 0. 
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First we examine (14). The condition [c — \*]j > 0 can be rewritten as 

Ci + [Alb-[FTa*;-FM. >0 

For any example i < n, if |[F^(j*]^| < 1, then [c — > q — [F^cr*; —F^cr*]^ > 0 (since 

> 0). Similarly, — [F^cr*; —F^cr*]^+ri > 0. By (14), this means Q = = 0, which 

implies z* = 0 by definition of C*. So we have shown that | [F^cr*]^ | < 1 z* = 0. 

It remains only to prove that |[F^cr*]^| > 1 z* = sgn([F^(j*]^) for any i G [n]. We 

first show this is true for [F^cr*]^ > 1. In this case, from the constraints we need c > X"", in 

particular that 


0 < [c - = [A^]i + {a - [F^a"]i) (16) 

By assumption, Cj — [F^u*]* = 1 — [F^(T*]j < 0. Combined with (16), this means we must have 
[Al]i > 0, i.e. X*_^p > 0. From (15), this means that 

[AC-B],+p^0 ^ C = 1 

Meanwhile, 

[c - = [A^]i+n + {Ci+n - [-F^a=^],) > a+n + [F^a*]i > 0 

SO from (14), = 0. Since Q = 1, this implies z* = 1 = sgn([F^cr*]^), as desired. 

This concludes the proof for examples i such that [F^cr*]^ > 1. The situation when [F^cr*]^ < 
— 1 is similar, but the roles of the and (i + n)^^ coordinates are reversed from (16) onwards in 
the above proof. ■ 


By further inspection of the subgradient conditions described in the body of the paper, one can 
readily show the following result, which complements Theorem 2. 


Corollary 10 For examples j such that 


x7<t> 


= 1 , 


z* = Cj sgn(xj(7*) 


where cj € [ 0 , 1 ] are as defined in ( 6 ). 


Appendix B. Miscellaneous Proofs 
Lemma 11 The function 7 ((j) is convex in a. 


xja 


- 1 


is convex in a. Therefore, the 


Proof To prove Part 1, note that for each j, the term 

average of n terms is convex. As the term —h^a is linear, the whole expression 7 ( 0 -) is convex. 
(This is a special case of the Lagrangian dual function always being concave in the dual variables.) ■ 
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Proof [Proof of Theorem 3] Let A = [F, —F] G and c = Then the first assertion 

follows by LP duality: 


1 T uT (^) • T/- 

max b (7 = max b a = mm c Q — mm 

A^f7<c, ^C>b, F(C^-C^)>b, 

^>0P cr>op 



+ 




IJ 


min llzll-, 
Fz>b ^ 


1 II II • 1 T (^) -It 

= mm — z . = mm max —z g = max mm —z g 

^Fz>b n ^Fz>b gG[-l,l]’^ ^ gG[-l,l]’^ ^F2:>b ri 


where (a) is by strong LP duality and (b) uses the minimax theorem. 

The second assertion follows because V = — 7 ( 0 -*) = b^cr* = max 

where (c) uses the definition of ZBR and (d) is due to the first assertion. 


min 

^Fz>b 

n — 


1 T 
-z g , 
n 


Proof [Proof of Prop. 5] The derivation here closely follows that of Lemma 7, except that (9) now 
instead becomes 


1 

— max 
n f 7 > 0 P 



g-F^a 


+ nh^ a 


which is equal to the final result. 


The proof of Prop. 6 is exactly analogous to that of Prop. 5. 
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